Coronavirus Molecule

Coronavirus Molecule

(“Image Library” 2020)

1 Introduction

1.1 Covid-19

COVID-19, a new strain of coronavrius, appeared onto the seen in early Decemeber of 2019, in Wuhan China. The new strain spread into the local surrounding regions of China and eventually escaped into the world. As of today, many nations have labeled COVID-19 as a pandemic.

Coronavirus’ are common in animals, and rarely infect humans; however, this new strain was born, whether naturaly or not it is unkown at this time, with the capability of infecting humans.

COVID-19 has been linked to the family of respiratory diseases, such as influenza and pneumonia; however, it has been stressed that COVID-19 is deadlier than these novel virus’ (Sauer 2020).

As of now, the Center for Disease Control and Prevention (CDC) has kept a very exhaustive record of the virus’ damage in the United States. On the CDC website, a webpage has been created, added to weekly, containing several tables containing the number of deaths from COVID-19 categorized by age, sex, underlying health conditions, and other factors (“Weekly Updates by Select Demographic and Geographic Characteristics” 2020). In addition, another webpage has been created that contains all the state provinces of the US along with their total COVID-19 deaths and cases thus far. In this analysis, we will seek to explain COVID-19 deaths per state as a function of selected variables.

1.1.1 NOTE

All current Coronavirus data performed in this analysis dates from 2/1/2020 to 10/31/2020 (“Daily Updates of Totals by Week and State” 2020).

1.2 Constructing the Dataset

In continuation of the data from the first webpage, Table 3 details “conditions contributing to deaths involving coronvarius disease” (“Weekly Updates by Select Demographic and Geographic Characteristics” 2020). In this table they detail several sections, namely respiratory diseases, circulatory diseases (cardiac arrest, heart failure, etc), sepsis, obesity, diabetes, and others as major underlying health concerns and comorbidities correlated to COVID-19 deaths. Therefore, we may hypothesis that COVID-19 deaths is a function of these underlying health conditions. The number of morbidities from Chronic Lower Respiratory Disease, Kidney Disease, Cancer, Hypertension, Diabetes, Influenza, Heart Disease, and Obesity per state were collected from the CDC (“Stats of the States,” n.d.).

In addition, we will add other hypothesized variables that intuitively support higher COVID-19 morbidity rates, such as: population density (“Population Density in the U.s. By Federal States Including the District of Columbia in 2019” 2019), number of persons over the age of 65 (From 2019 Data (“Population Distribution by Age” 2019)), and number of cases (“Stats of the States,” n.d.). We hypothesis that states with higher population density may easily contribute the number of cases as it closens the proximity between a healthy and sick person, thus leading an increase in morbidity rate for COVID-19. It can be shown from the CDC website that approximately 78.9% of all US COVID-19 deaths belonged to those over the age of 65; thus, states with older populations might contribute to COVID-19 morbidity rates. Lastly, we hypothesise that states with higher number of cases will have higher morbidity rates than other states (“Stats of the States,” n.d.).

1.3 Variables and Data

The dataset has 50 observations, one for each state, consisting of the number of COVID-19 cases and deaths, morbidity cases of those who died from Diabetes, Hypertension, Cancer, Kidney Disease, Chronic Lower Respiratory Disease, and the number of obese persons along with the number of persons over the age of 65, and population density (number of people per square mile) per state. However, to account for difference in state population size, each of these variables, except population density was divided by the state’ population to get the appropiate percentages.

Population Density should be interpreted as the number of people per square mile. Obesity Rate, Diabetes Rate, and 65 and Older Rate should be interpreted as the percentage of the state’s population that is Obese, has Diabetes, or Older than 65. Covid Case Rate should be interpreted as the percentage of the states population that has tested COVID-19 positive. The rest of the variables, Covid Death Rate, Hypertension Rate, Kidney Disease Rate, and Cancer Rate should be interpreted as the percentage of the states population that have died from the respective condition (Note: These were collected on a yearly basis, the non-covid variables were collected from a 2018 dataset).

For example, Alabama (AL) has a death rate of 0.0606% of their entire population for COVID-19, a death rate of 0.2168% for Kidney Disease, and a positive covid case rate of 3.974%; on the other hand, Alabama has a population density of about 97 persons per square mile and an Obesity rate of 36.1% of their entire population.

STATES = data$STATE
data = 100*data[,-1]
data[,12] = data[,12]/100
rownames(data) = STATES
paged_table(data)

1.4 Examining the Data

Now since the data set has been constructed, multivariate procedures can be applied to derive information before continuing with the regressional analysis.

1.4.1 Outliers

Outliers can be found by calculating the squared distances of each data observation using the following equation: \((X-\mu)'\Sigma^{-1}(X-\mu)\), where \(\mu\) is the mean vector of the variables, \(\Sigma\) is the covariance matrix of the dataset, and \(X\) is the dataset. It can be shown that any square value greater than a \(1-\alpha/2\) Chi-Square quantile with degrees of freedom greater than the number of variables is an outlier. Here are the squared distances with a red line denoting the \(0.975\)% Chi-Square quantile:

tdata = data[,-11]
s = solve(cov(tdata))
meanVector = matrix(c(mean(data$`Covid Death Rate`), mean(data$`Covid Case Rate`), mean(data$`Diabetes Rate`), mean(data$`Hypertension Rate`), mean(data$`Kidney Disease Rate`), mean(data$`Chronic Lower Respiratory Disease Rate`), mean(data$`Influenza Rate`), mean(data$`Obesity Rate`), mean(data$`65 and Older Rate`), mean(data$`Heart Disease Rate`), mean(data$`Population Density`)))
sqd = matrix(0, ncol=2, nrow=50)
for(i in 1:nrow(data)) {
  x = as.matrix(tdata[i,]-meanVector)
  sqd[i,2] = x%*%s%*%t(x)
  sqd[i,1] = i
}
sqd = data.frame(sqd)
colnames(sqd) = c("Index", "Squared Value")
ggplot(data=sqd, aes(x=Index, y=`Squared Value`))+geom_point()+geom_hline(yintercept = qchisq(1-0.05/2, df=ncol(data)), color="red")+xlab("Data Row Index")+ggtitle("Outliers: Squared Distances with Chi-Square Quantile Cutoff")+theme(plot.background=element_rect(fill="grey"))

As we can see here, we have three suspectable outliers. Their indices are 4, 11, and 48. Their z-score values are:

tdata = scale(data)
paged_table(as.data.frame(cbind(STATES[c(2,11, 48)],tdata[c(2, 11, 48),])[,-1]))

We can see that Arkansas has an extremely low z-score for both Diabetes and Kidney Disease Rates. In Hawaii, we can see that it has a large z-score for Influenza/Pneumonia Death Rate. For West Virginia, it has a large z-score rate for Diabetes and Chronic Lower Respiratory Disease Rate.

1.4.2 Cluster Analysis

1.4.2.1 Dendrogram

Here is the cluster dendrogram based on the data observations, states. In a typcial cluster analysis, one would choose two clusters for this dataset; however, due to the extremely distant merge point between the to main clusters, the right sub cluster can be broken down into its clusters. Wards method was used for calculated the distance in this cluster analysis.

tempDF = orig[,-1]
rownames(tempDF) = orig[,1]
tempDF %>%
  scale() %>%
  dist() %>%
  hclust(method="ward.D2") %>%
  fviz_dend(cex=0.5, k=4)

The left most distant cluster contains states such as Arkansas, Alaska, Texas, Utah, and Colorado. The leftmost cluster of the second major sub cluster contain extremely small proximity: New York, New Jersey, Rhode Island, Massachusetts, and Connecticut. The other two clusters pertain to the rest of the states.

1.4.2.2 Heatmap

Here is a heatmap of each state crossed with the variable’s scaled values. On the top, the same cluster analysis is used as the one previously detailed on the dendrogram section. On the left side a new cluster analysis based upon the variables is used. From this one can see that there exists three main clusters, one for COVID-19 Death Rate and Population Density; second for Covid Case Rate and Obesity Rate; and lastly for the rest of the variables (majority of underlying health conditions).

tempDF %>%
  scale() %>%
  t() %>%
  pheatmap(cutree_cols = 13)

1.4.3 Principal Component Analysis

Principal Component Analysis seeks to explain the variance-covariance structure of a dataset by representing it by a set of orthogonal vectors, namely the eigen vectors. More behind this concept will be explained later.

Here we have the scree plot (proportion of each eigen value to the sum of eigen values). Components with higher eigen values contribute more variance explanatory power to the model.

pc = prcomp(tempDF, scale=TRUE)
fviz_eig(pc)

From the plot one can see that this dataset contains one major component and one to two smaller components.

The biplot of the first two components can be plotted against the component scores (original dataset multiplied by the first two principal components):

ggbiplot(pc)

From the plot, principal component one explains approximately 48.2% of the variability in the dataset, while the second component only explains 16.8%. As one can see, COVID-19 Death Rate and Population Density are heavily influenced by the second Principal Component, note this attests to the cluster analysis performed earlier. Kidney Disease, Diabetes, and Obesity Rate are influenced by the negative of the first component and second component. The rest of the variables seem to be centered about the zero of the second component but the negative of the first.

2 Theory behind Multiple Linear Regression

In multiple linear regression, one is trying to relate a single variable, \(y\), to a matrix of independent \(x\) variables. Thus the following model is used:

\[ Y=X\beta+\epsilon \]

2.1 \(\beta\) Coefficients

From the model, \(\beta\), can be estimated by the geometry of least squares. The model supposes that \(E(y)\) is a linear combination of the variables of \(X\). Thus, \(X\beta\) spans the column space of \(X\). In addition, the error, \(\hat \epsilon\), used to make our model exact is simply the difference in \(Y\) and \(E(y)\); that is, \(\hat \epsilon=y-\hat y\). To derive \(\beta\),

\[ \begin{eqnarray} Y&=&X\hat \beta +\hat \epsilon\\ Y-X\hat \beta -\hat \epsilon&=& 0\\ X'(Y-X\hat \beta-\hat \epsilon)& =& 0 \\ X'Y - X'X\hat \beta -X'\hat \epsilon&=& 0 \\ X'X \hat \beta&=& X'Y \ \ \ \ \ \ \ \ X' \text{ and } \hat \epsilon \text{ are orthogonal} \\ (X'X)^{-1}X'X\hat \beta&=&(X'X)^{-1}X'Y \\ I\hat \beta &=& (X'X)^{-1}X'Y \\ \hat \beta &=& (X'X)^{-1}X'Y \\ \end{eqnarray} \]

However, it must be important to note that this derviation pre-supposes the exists of \((X'X)^{-1}\), therefore \(X\) must be linearly independent.

2.2 Q Theorem

This fundamental theorem will help simplify the derivations furthur along in this analysis. The Q theorem states the following:

Let \(Q=V'AV\) where \(V\) is a vector distributed \(N_n(0, I_n)\) and \(A\) is idempotent, \(A'=A\) and \(A^2=A\), then \(Q=\chi^2_r\) where \(r=Tr(A)\). Before one can prove this, a few asided to be stated.

2.2.1 Asides

  1. Spectral Decompostion: If \(A\) is a \(k\text{ x } k\) symmetric matrix, then \(A\) can be represented in terms of its eigenvalue (\(\lambda_i\)) eigenvaector (\(e_i\)) pairs: \(A=\sum^k\lambda_ie_ie_i'\)

  2. A Chi-square distribution is defined as the sum of z squared values distributed \(N(0, I)\)

  3. If \(A\) is idempotent, \(A^2=A\), then it’s eigen values are either 0 or 1
    • Eigenvalues and Eigenvectors are defined as such: \(Ae=\lambda e\)
    • Then: \[ \begin{eqnarray} AAe&=&A\lambda e \\ A^2e&=& \lambda Ae \\ Ae &=& \lambda \lambda e \\ Ae &=& \lambda^2 e \\ 0 &=& \lambda^2 e - Ae \\ 0 &=& \lambda^2 e - \lambda e \\ 0 &=& \lambda e (\lambda - 1) \\ \end{eqnarray} \]
    • Therefore, \(\lambda\in {0,1}\)
  4. If \(A\) is symmetric, then \(Tr(A)=\sum\lambda_i\)
    • \(Tr(A)=Tr(\sum^k\lambda_ie_ie_i')=Tr(T\Lambda T')\) where \(T=[e_1,\dots,e_k]\) and \(\Lambda=\lambda_iI\ \forall \ i\).
    • \(Tr(A)=Tr(T\Lambda T')\), \(Tr(T\Lambda T')=Tr(\Lambda T'T)\) by rules of traces, \(Tr(\Lambda T'T)=Tr(\Lambda)\) because \(TT'\) is orthonormal, \(e_ie_j\) are orthonormal.
    • \(Tr(A)=Tr(\Lambda)=Tr(\lambda_iI)=\sum\lambda_i\)

2.2.2 Main Proof

\[ \begin{eqnarray} Q&=&V'AV \\ &=& V'(\sum^k\lambda_ie_ie_i')V \ \ \text{by Aside 1} \\ &=& \sum^k\lambda_iV_i'e_ie_i'V_i \\ \end{eqnarray} \]

Let \(Z_i=V'_ie_i\) where \(E(Z_i)=E(V'_ie_i)=e_iE(V'_i)=e_i0=0\) and \(V(Z_i)=V(V'_ie_i)=e_iV(V'_i)e_i'=e_iIe_i'=I\) by \(e_ie_i'\) being orthonormal. Therefore, \(Z_i\sim N(0, 1)\). In addition, \(Z_i=V'_ie_i=e_i'V_i=Z_i'\); therefore, \(Z^2=V_i'e_ie_i'V_i\).

\[ \begin{eqnarray} &=& \sum^k\lambda_iV_i'e_ie_i'V_i \\ &=& \sum^k\lambda_i Z_i^2 \\ &=& \chi^2_{r} \ \ \ \text{By Aside 2} \end{eqnarray} \]

As one can see, \(A\) is actually a Chi square distribution with some \(r\) degrees of freedom, the number of eigenvalues that are nonzero. However, due to the third aside, all the eigen values for an idempotent matrix are either 0 or 1, where the amount of eigenvalues that are 1 is equal to the trace of the matrix, Aside Four. Therefore, \(r=Tr(A)\).

This theorem will be used heavily in the near future.

2.3 \(S^2\)

We will use \(s^2\) as an unbiased estimator for \(\sigma^2\). That is, \(s^2\) represents the residual variance. This value can be derived through \(RSS\) and \(TSS\).

2.3.1 RSS

The Residual Sum of Squares, \(RSS\), is defined as \(\sum(y_i-\hat y_i)^2\), which equals \((Y-\hat Y)'(Y-\hat Y)\) in matrix notation. As stated before in the geometry of least squares, \(\hat \epsilon=y-\hat y\), therefore \(RSS=\hat \epsilon'\hat \epsilon\). One can decompose \((Y-\hat Y)\) into the equivalent form of projection matrices \(P\), then \(\hat \epsilon=(I-P)Y\).

\(P\) can be derived by altering the following equation:

\[ \begin{eqnarray} PY &=& X\hat \beta \\ &=& X\hat \beta \\ &=& X(X'X)^{-1}X'Y \\ \frac{PY}{Y} = \frac{X(X'X)^{-1}X'Y}{Y} \\ P =X(X'X)^{-1}X' \end{eqnarray} \]

  • As an aside, the statement that \(P=X(X'X)^{-1}X'\) is an idempotent matrix is asserted. Proof:
    • \(P'=P\) \[ \begin{eqnarray} P'&=&(X(X'X)^{-1}X')' \\ &=& X''(X(X'X)^{-1})' \\ &=& X'((X'X)^{-1'}X') \\ &=& X'(X'X)^{-1}X') \\ &=& P \end{eqnarray} \]
    • \(P^2=P\) \[ \begin{eqnarray} P^2&=&(X(X'X)^{-1}X')(X(X'X)^{-1}X') \\ &=& (X(X'X)^{-1})(X'X(X'X)^{-1})(X') \\ &=& (X(X'X)^{-1})I(X') \\ &=& (X(X'X)^{-1})(X') \\ &=& P \end{eqnarray} \]

Therefore:

\[ \begin{eqnarray} RSS &=& \hat \epsilon'\hat \epsilon \\ &=& [(I-P)Y]'(I-P)Y \end{eqnarray} \]

But,

\[ \begin{eqnarray} \hat \epsilon&=&(I-P)Y \\ &=& (I-P)(X\beta+\epsilon) \\ &=& X\beta-PX\beta+(I-p)\epsilon \\ &=& IX\beta - X(X'X)^{-1}X'X\beta+(I-p)\epsilon \\ &=& X\beta - XI\beta+(I-p)\epsilon \\ &=& X\beta - X\beta+(I-p)\epsilon \\ &=& (I-p)\epsilon \\ \end{eqnarray} \]

Then,

\[ \begin{eqnarray} RSS &=& \hat \epsilon'\hat \epsilon \\ &=& [(I-P)\epsilon]'(I-P)\epsilon \\ &=& \epsilon'(I-P)'(I-P)\epsilon \end{eqnarray} \]

But \(I-P\) is idempotent as both \(I\) and \(P\) are idempotent. Now \(RSS\) is divided by \(\sigma^2\):

\[ \begin{eqnarray} RSS &=& \epsilon'(I-P)'(I-P)\epsilon \\ &=& \epsilon'(I-P)(I-P)\epsilon \\ &=& \epsilon'(I-P)^2\epsilon \\ &=& \epsilon'(I-P)\epsilon \\ \frac{RSS}{\sigma^2} &=&\frac{\epsilon'}{\sigma}(I-P)\frac{\epsilon}{\sigma} \end{eqnarray} \]

Now the derived Q theorem will be used, but an extremely important assumption is presumed, \(\epsilon\sim_n N(0, \sigma^2I_n)\). Let \(V=\frac{\epsilon}{\sigma}\) so that

\[ \begin{eqnarray} \frac{RSS}{\sigma^2} &=&\frac{\epsilon'}{\sigma}(I-P)\frac{\epsilon}{\sigma} \\ &=& V'(I-P)V \end{eqnarray} \]

Where \(E(V)=E(\frac{\epsilon}{\sigma})=0\) and \(Cov(V)= Cov(\frac{\epsilon}{\sigma})=\frac{1}{\sigma^2}V(\epsilon)=\frac{\sigma^2I}{\sigma^2}=I\); therefore, \(V\sim N(0, I_n)\). Now the Q theorem will be applied to state

\[ \frac{RSS}{\sigma^2}=\chi^2_r \]

Where:

\[ \begin{eqnarray} r&=& Tr(A) \\ &=& Tr(I-P) \\ &=& Tr(I)-Tr(P) \\ &=& n- Tr(X(X'X)^{-1}X') \\ &=& n- Tr(X'X(X'X)^{-1}) \\ &=& n- Tr(I_{k+1}) \\ &=& n- (k+1) \end{eqnarray} \]

Where \(k+1\) appears because \(X'\) is a \((k+1)\text{x}n\) matrix, therefore \(I_{k+1}\).

2.3.2 TSS

The Total Sum of Squared, \(TSS\), is defined as \(\sum(y_i-\bar y)^2\), which equals \((y-\bar y)'(y-\bar y)\) in matrix notation.

\[ \bar y = \frac{\sum y}n= \frac{1}{n} \begin{bmatrix} 1 & \dots & 1 \\ \vdots &\ddots & \vdots \\ 1 & \dots & 1 \end{bmatrix} \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} \\ \]

For ease of notation let \(J_n\) be the symmetric matrix of one’s, then \(\bar y= J_nY/n\).

\[ \begin{eqnarray} TSS &=& (y-\bar y)'(y-\bar y) \\ &=& (y-J_nY/n)'(y-J_nY/n) \\ &=& [(I-\frac{J_n}{n})Y]'(I-\frac{J_n}{n})Y \\ &=& Y'(I-\frac{J_n}{n})'(I-\frac{J_n}{n})Y \end{eqnarray} \]

  • As an aside, \((I-\frac{J_n}{n})\) is asserted to be an idempotent projection matrix. Proof:
    • \((I-\frac{J_n}{n})'=I'-J_n'/n=I-\frac{J_n}{n}\), therefore symmetric.
    • \((I-\frac{J_n}{n})^2=I^2-\frac{2}{n}J_n-\frac{1}{n^2}J_n^2\), but \[ J_n^2=\begin{bmatrix}n&\dots &n\\ \vdots &\ddots & \vdots \\ n & \dots & n \end{bmatrix} \]
    • Then \(\frac{1}{n^2}J_n^2=\) \[ \frac{1}{n^2}\begin{bmatrix}n&\dots &n\\ \vdots &\ddots & \vdots \\ n & \dots & n \end{bmatrix}=\frac{1}{n}J_n \]
    • Therefore, \(I^2-\frac{2}{n}J_n-\frac{1}{n^2}J_n^2=I-\frac{2}{n}J_n-\frac{1}{n}J_n=I-\frac{J_n}{n}\)

Because \(I-\frac{J_n}{n}\) is an idempotent matrix:

\[ \begin{eqnarray} TSS &=& Y'(I-\frac{J_n}{n})'(I-\frac{J_n}{n})Y \\ &=& Y'(I-\frac{J_n}{n})(I-\frac{J_n}{n})Y \\ &=& Y'(I-\frac{J_n}{n})^2Y \\ &=& Y'(I-\frac{J_n}{n})Y \\ \end{eqnarray} \]

Now, \(Y\) is tranformed to the y value of the z-score \(z=\frac{y-\mu}{\sigma}=>y=z\sigma+\mu\), where \(Z\sim N(0, 1)\)

\[ \begin{eqnarray} TSS &=& Y'(I-\frac{J_n}{n})Y \\ &=& (\mu+\sigma Z)'(I-\frac{J_n}{n})(\mu+\sigma Z) \\ &=& (\sigma Z)'(I-\frac{J_n}{n})(\sigma Z) \\ &=& \sigma^2 Z'(I-\frac{J_n}{n}) Z \\ \end{eqnarray} \]

Now, \(TSS\) is divided by \(\sigma^2\) to get:

\[ \begin{eqnarray} \frac{TSS}{\sigma^2} &=& Z'(I-\frac{J_n}{n}) Z \\ \end{eqnarray} \]

Because \(I-\frac{J_n}{n}\) is idempotent and \(Z\sim N(0, 1)\) then the Q theorem is applied:

\[ \frac{TSS}{\sigma^2}=\chi^2_r \]

Where:

\[ \begin{eqnarray} r&=& Tr(A) \\ &=& Tr(I-\frac{J_n}{n}) \\ &=& Tr(I) - Tr(\frac{J_n}{n}) \\ &=& n - \frac{n}{n} \\ &=& n - 1 \end{eqnarray} \]

2.3.3 Unbiased Estimator

An unbiased estimator is a statistic whose expected value equals the parameter it is trying to estimate. It was previously derived that \(\frac{RSS}{\sigma^2}=\chi^2_{n-(k+1)}\) and \(\frac{TSS}{\sigma^2} = \chi^2_{n-1}\).

2.3.3.1 RSS Unbiased

For \(\frac{RSS}{\sigma}=\chi^2_{n-(k+1)}\), we get the following mean \(E(\chi^2_{n-(k+1)})=n-(k+1)\).

The sample statistic \(\frac{RSS}{s^2} = n-(k+1)\) as an estimator of \(\frac{RSS}{\sigma^2}=\chi^2_{n-(k+1)}\) will be asserted. To prove that this estimator is unbiased then it’s mean must equal the paramter of interest.

\[ \begin{eqnarray} s^2 = \frac{RSS}{n-(k+1)} \\ RSS = {\sigma^2}\chi^2_{n-(k+1)}\\ E(s^2) &=& E(\frac{RSS}{n-(k+1)}) \\ &=& E(\frac{\sigma^2 \chi^2_{n-(k+1)}}{n-(k+1)} ) \\ &=& \frac{\sigma^2}{n-(k+1)}E(\chi^2_{n-(k+1)} ) \\ &=& \frac{\sigma^2}{n-(k+1)}(n-(k+1)) \\ &=& \sigma^2 \\ \end{eqnarray} \]

Therefore, \(s^2\) is an unbiased estimator for \(\sigma^2\).

2.3.3.2 TSS Unbiased

For \(\frac{TSS}{\sigma^2} = \chi^2_{n-1}\), we get the following mean \(E(\chi^2_{n-1}) = n-1\).

The sample statistic \(\frac{TSS}{s^2} = n-1\) as an estimator of \(\frac{TSS}{\sigma^2} = \chi^2_{n-1}\) will be asserted. To prove that this estimator is unbiased then it’s mean must equal the paramter of interest.

\[ \begin{eqnarray} s^2 = \frac{TSS}{n-1} \\ TSS = {\sigma^2}\chi^2_{n-1}\\ E(s^2) &=& E(\frac{TSS}{n-1}) \\ &=& E(\frac{\sigma^2 \chi^2_{n-1}}{n-1} ) \\ &=& \frac{\sigma^2}{n-1}E(\chi^2_{n-1} ) \\ &=& \frac{\sigma^2}{n-1}(n-1) \\ &=& \sigma^2 \\ \end{eqnarray} \]

Therefore, \(s^2\) is an unbiased estimator for \(\sigma^2\).

2.4 T Test Statisic

2.4.1 Intuition

Let \(l=a_0\beta_0+\dots+a_k\beta_k=a'\beta\), then an estimate for \(l\) is \(\hat l\). Where its mean is:

\[ \begin{eqnarray} E(\hat l)&=&E(a'\hat \beta) \\ &=& E(a'(X'X)^{-1}X'Y) \\ &=& a'(X'X)^{-1}X'E(Y) \\ &=& a'(X'X)^{-1}X'X\beta \\ &=& a'I\beta \\ &=& a'\beta \\ \end{eqnarray} \]

and its variance is:

\[ \begin{eqnarray} V(\hat l) &=& Cov(a' \hat \beta) \\ &=& a' Cov(\hat \beta) a \\ &=& a' Cov((X'X)^{-1}X'Y) a \\ &=& a' (X'X)^{-1}X'Cov(Y) [(X'X)^{-1}X']'a \\ &=& a' (X'X)^{-1}X'\sigma^2[(X'X)^{-1}X']'a \\ &=& a' (X'X)^{-1}X'\sigma^2X(X'X)^{-1}a \\ &=& \sigma^2 a' (X'X)^{-1}(X'X(X'X)^{-1})a \\ &=& \sigma^2 a' (X'X)^{-1}Ia \\ &=& \sigma^2 a' (X'X)^{-1}a \\ \end{eqnarray} \]

Because \(\hat l\) is a linear combination of \(\hat \beta\), then it is distributed Normally with \(N(a'\beta, \sigma^2 a' (X'X)^{-1}a)\). Now this can be converted to a z-score, \(z=(x-\mu)/\sigma\):

\[ Z=\frac{\hat l-a'\beta}{\sqrt{\sigma^2 a' (X'X)^{-1}a}} \]

However, because two population paramters are present, \(\beta\) and \(\sigma\), no inferences can be made. In response, one can conver this to a T statistic.

2.4.2 Conversion

A T statistic is given by the following general formula: \(\frac{Z}{\sqrt{\chi^2/v}}\), a z-score divided by the square root of the quotient between a Chi square distribution and its degrees of freedom. It was shown that \(\chi^2_{n-(k+1)}=RSS\sigma^2\), with an unbiased estimator of \(n-(k+1)=s^2RSS\). Now the equation can be rearranged to solve for \(RSS\):

\[ s^2=\frac{RSS}{n-(k+1)} \text{ and } \sigma^2=\frac{RSS}{\chi^2_{n-(k+1)}} \\ (n-(k+1))s^2=\sigma^2\chi^2_{n-(k+1)} \\ \therefore \chi^2_{n-(k+1)}=\frac{(n-(k+1))s^2}{\sigma^2} \]

Now this Chi-square distribution can be applied to the T statistic with the z-score containing \(\hat l\) to derive:

\[ T=\frac{Z}{\sqrt{\chi^2/v}}=\frac{\frac{\hat l-a'\beta}{\sqrt{\sigma^2 a' (X'X)^{-1}a}}}{\sqrt{\frac{(n-(k+1))s^2}{\sigma^2}/(n-(k+1))}} \]

which simplifies down to: \(\frac{\hat l-a'\beta}{\sqrt{s^2 a' (X'X)^{-1}a}}\). Now because only one population parameter is present in the statistic, a confidence interval can be derived.

2.4.3 Confidence Intervals

Because the T statistic is a pivotal statistic, a confidence interval can be derived by using the Pivotal method:

\[ \begin{eqnarray} 1-\alpha&=&P(-t_{\alpha/2}\le T\le t_{\alpha/2}) \\ &=&P(-t_{\alpha/2}\le \frac{\hat l-a'\beta}{\sqrt{s^2 a' (X'X)^{-1}a}}\le t_{\alpha/2}) \\ &=&P(-\sqrt{s^2 a' (X'X)^{-1}a}t_{\alpha/2}\le \hat l-a'\beta\le \sqrt{s^2 a' (X'X)^{-1}a}t_{\alpha/2}) \\ &=&P(-\hat l-\sqrt{s^2 a' (X'X)^{-1}a}t_{\alpha/2}\le -a'\beta\le -\hat l+\sqrt{s^2 a' (X'X)^{-1}a}t_{\alpha/2}) \\ &=&P(\hat l-\sqrt{s^2 a' (X'X)^{-1}a}t_{\alpha/2}\ge a'\beta\ge \hat l-\sqrt{s^2 a' (X'X)^{-1}a}t_{\alpha/2}) \\ &=& \hat l \pm t_{\alpha/2}\sqrt{s^2 a' (X'X)^{-1}a} \end{eqnarray} \]

After this derivation, one has the ability to create \((1-\alpha)100\)% confidence intervals some linear combination of the \(\beta\)’s. This formula will b used to create confidence intervals for \(\beta\) coefficients and create T test statistics for the coefficents. If creating T test Statistics, \(\frac{\hat l-a'\beta}{\sqrt{s^2 a' (X'X)^{-1}a}}\) will be used, where \(a'\beta\) is the hypothesized value and \(\sqrt{s^2 a' (X'X)^{-1}a}\) is the standard deviation. After creating the T test statistc quantile, \(T\sim t(n-(k+1))\), probability of the lower tail occuring can be calculated. If the associated p-value is greater than the chosen \(\alpha\) level then there exists insufficient evidence to reject the NULL Hypothesis, \(\hat l=a'\beta\), and conclude that it is plausible \(\hat l=a'\beta\); however, if the p-value is less than the chosen \(\alpha\) level then there exists sufficient evidence to reject the NULL Hypothesis and conclude with confidence that \(\hat l \ne a'\beta\).

2.5 F Test Statistic

2.5.1 Intuition

As stated before, the geometry of the least squares solution takes place on the column space of x, \(\Omega\). Now let there exist a subspace of \(\Omega\), little \(\omega\). In this subspace \(y_\omega=x\beta+\epsilon=\beta_0+\beta_1x_1+\dots+\beta_gx_g\); but for \(\Omega\), \(y_\Omega=\beta_0+\beta_1x_1+\dots+\beta_gx_g+\beta_{g+1}x_{g+1}+\dots+\beta_kx_k\). Therefore, there will be two \(RSS\) values, one for the subspace \(\omega\) and one for \(\Omega\). Let \(RSS\) denote residual sum of squares for \(\Omega\) and \(RSS_{H_0}\) denote the residual sum of squared for \(\omega\).

Therefore,

\[ RSS = \epsilon'(I-P_\Omega)\epsilon \\ RSS_{H_0} = \epsilon'(I-P_\omega)\epsilon \\ \]

The difference of these two residual sum of squares can be taken to get:

\[ \begin{eqnarray} RSS_{H_0}-RSS&=& \epsilon'(I-P_\omega)\epsilon-\epsilon'(I-P_\Omega)\epsilon \\ &=& \epsilon'((I-P_\omega)-(I-P_\Omega))\epsilon \\ &=& \epsilon'(I-P_\omega-I+P_\Omega)\epsilon \\ &=& \epsilon'(P_\Omega-P_\omega)\epsilon \\ \end{eqnarray} \]

Now the division by \(\sigma^2\):

\[ \begin{eqnarray} \frac{RSS_{H_0}-RSS}{\sigma^2}&=& \frac{\epsilon'}{\sigma}(P_\Omega-P_\omega)\frac{\epsilon}{\sigma} \\ \end{eqnarray} \]

Let \(V=\frac{\epsilon}{\sigma}\), then \(E(V)=E(\frac{\epsilon}{\sigma})=0\) and \(V(\frac{\epsilon}{\sigma})=\frac{V(\epsilon)}{\sigma^2}=\frac{\sigma^2I}{\sigma^2}=I\) by the earlier assumption \(\epsilon\sim N(0, \sigma^2I)\). Therefore, \(V\sim N(0, I)\). Before the Q theorem can be applied, \((P_\Omega-P_\omega)\) must be proved to be idempotent.

  • As an aside, \((P_\Omega-P_\omega)\) is asserted to be idempotent. Proof:
    • \((P_\Omega-P_\omega)'=(P_\Omega-P_\omega)\): \[ \begin{eqnarray} (P_\Omega-P_\omega)'&=& P_\Omega'-P_\omega' = P_\Omega-P_\omega \end{eqnarray} \]
    • \((P_\Omega-P_\omega)^2=(P_\Omega-P_\omega)\): \[ \begin{eqnarray} (P_\Omega-P_\omega)^2 &=& (P_\Omega-P_\omega)(P_\Omega-P_\omega) \\ &=& P_\Omega^2-P_\Omega P_\omega - P_\omega P_\Omega + P_\omega^2 \\ &=& P_\Omega-P_\Omega P_\omega - P_\omega P_\Omega + P_\omega \\ &=& P_\Omega-P_\omega - P_\omega+ P_\omega \ \ \ \Omega \text{ projected against its subspace is its subspace} \\ &=& P_\Omega-P_\omega \\ \end{eqnarray} \]

Because \(V\sim N(0, I)\) and \((P_\Omega-P_\omega)\) is idempotent the Q theorem is applied to conclude that:

\[ \frac{RSS_{H_0}-RSS}{\sigma^2} = \chi^2_r \]

Where: \[ \begin{eqnarray} r&=&Tr(A) \\ &=& Tr(P_\Omega-P_\omega) \\ &=& Tr(X(X'X)^{-1}X')-Tr(X(X'X)^{-1}X') \\ &=& Tr(X'X(X'X)^{-1})-Tr(X'X(X'X)^{-1}) \\ &=& Tr(I_{k+1}) - Tr(I_{g+1}) \\ &=& k+1-g+1 \\ &=& k-g \end{eqnarray} \]

However, \(\frac{RSS_{H_0}-RSS}{\sigma^2}\) contains a population parameter \(\sigma\), so this must be converted into an F statistic.

2.5.2 Conversion

An F statistic is given by the following general formula \(\frac{\chi^2_1/v_1}{\chi^2_2/v_2}\), a Chi square distribution divided by its degrees of freedom divided by another independent Chi square distribution divided by its degrees of freedom. It has been proven that \(\chi^2_{n-(k+1)}=\frac{RSS}{\sigma^2}\), therefore one can divide the difference in \(RSS\) values by this:

\[ F=\frac{\frac{RSS_{H_0}-RSS}{\sigma^2}/(k-g)}{\frac{RSS}{\sigma^2}/[n-(k+1)]} \]

Now it must be asserted that these two Chi Square distributions are independent, their covariance is zero. Proof:

\[ \begin{eqnarray} F=\frac{\frac{RSS_{H_0}-RSS}{\sigma^2}/(k-g)}{\frac{RSS}{\sigma^2}/[n-(k+1)]} = \frac{\frac{\epsilon'(P_\Omega-P_\omega)\epsilon}{\sigma^2}/(k-g)}{\frac{\epsilon'(I_\Omega-P_\Omega)\epsilon}{\sigma^2}/[n-(k+1)]} \\ Cov((P_\Omega-P_\omega), (I_\Omega-P_\Omega)) &=& (P_\Omega-P_\omega)Cov(\epsilon)(I_\Omega-P_\Omega)' \\ &=& (P_\Omega-P_\omega)Cov(\epsilon)(I_\Omega-P_\Omega)' \\ &=& \sigma^2 (P_\Omega-P_\omega)(I_\Omega-P_\Omega) \\ &=& \sigma^2 (P_\Omega I_\Omega-P_\Omega^2 - P_\omega I_\Omega+P_\omega P_\Omega) \\ &=& \sigma^2 (P_\Omega -P_\Omega - P_\omega I_\Omega+P_\omega P_\Omega) \\ &=& \sigma^2 (P_\Omega -P_\Omega - P_\omega+P_\omega) \ \ \ \Omega \text{ projected against its subspace is its subspace} \\ &=& \sigma^20 \\ &=& 0 \end{eqnarray} \]

2.5.3 Brief Summary

The F test statistic is used to compare two models in the columnspace of \(X\).

\[ M_g: y=x\beta+\epsilon=\beta_0+\beta_1x_1+\dots+\beta_gx_g +\epsilon\\ M_k: y=\beta_0+\beta_1x_1+\dots+\beta_gx_g+\beta_{g+1}x_{g+1}+\dots+\beta_kx_k+\epsilon \]

One can calculate the F statistic quantile and compute its lower tail p-value, where \(F\sim f(k-g, n-(k+1))\). If the associated p-value is greater than the chosen \(\alpha\) then there exists insufficient evidence to reject the NULL Hypothesis \(\beta_{g+1}=\dots=\beta_k=0\). However, if the \(\alpha\) is less than our p-value then there exists sufficient evidence to reject the NULL and conclude with confidence that atleast one \(\beta_i\ne0\forall i>g\).

2.6 Assumptions

The only assumptions given during the theoretical derivations is that \(\epsilon\sim N(0, \sigma^2I)\). This can be broken down into four main lesser assumptions:

  1. \(\epsilon\) is distributed Normally

  2. \(E(\epsilon)=0\), mean of zero

  3. \(V(\epsilon)=\sigma^2I\), constant variance

  4. All \(\epsilon_i\)’s are independent
    • Although not explicitly stated, this is implied as \(Cov(\epsilon)=\sigma^2I\) where the non-diagonal values of \(I\) are zero, which is the covariance values between each \(\epsilon_i, \epsilon_j|i\ne j\).

For the mathematical theory used thus far to be correctly applied in action, these assumptions must hold before any conclusions can be made.

2.7 Measuring Model Adequacy

2.7.1 \(R^2\)

Pearsons Product Moment Coefficient of correlation \(r\) is a measure of the strength in a relationship between two variables. The coefficient of determination, \(R^2\), represents the “fraction of the sample variation of the y values that is attributed to the regression model”. The coefficient is defined as the quotient between \(MSS/TSS\). In multiple linear regression, the total sum of squares is the sum between the residual sum of squares and model sum of squares \(TSS=RSS+MSS\). Thus, if \(MSS\) is extremely large such that \(RSS\) is small then our \(R^2\) value will be closer to 1. Thus far, only \(RSS\) and \(TSS\) have been derived, not \(MSS\). However, \(MSS=TSS-RSS\), therefore: \(R^2=\frac{MSS}{TSS}=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS}\), a computable value.

There exiss a way of measuring model adequacy, but the interpretation has behind the scalar quantity has not been given. The following table is given as a general “Rule of Thumb for interpreting the Size of a Correlation Coefficient” (see Mukaka 2012).

kable(read.csv("BOOK1.csv", fileEncoding="UTF-8-BOM"))
SIZE.OF.CORRELATION INTERPRETATION
.90 to 1.00 (-.90 to -1.00) Very high positive (negative) correlation
.70 to .90 (-.70 to -.90) High positive (negative) correlation
.50 to .70 (-.50 to -.70) Moderate positive (negative) correlation
.30 to .50 (-.30 to -.50) Low positive (negative) correlation
.00 to .30 (.00 to -.30) negligible correlation

2.7.2 \(R_a^2\)

The standard definition for \(R^2\) will become larger as \(MSS->TSS\), which occurs when more variables are added. Thus, one can artificially create an influential model by simply added variables. To account for this, the adjusted R squared value was born, which takes a proportion of the number of observations with number of variables into account:

\[ R^2_a=1-\frac{n-1}{n-(k+1)}(1-R^2) \]

2.8 AIC

With datasets containing losts of variables, the sample T test statistics will eventually result in a type one error as multiple comparisons are being performed. To alleviate this problem, a criterion was developed by Hirotugu Akaikike, called Akaike Information Criterion (AIC). AIC is the measurement of the expected error in resampled response to a sample, also known as in-sample prediction error. Founded off of information theory, AIC measures the loss of relative information for a given model when reducing.

In definition, \(AIC=2k-2\ln(\hat L)\) for \(k\) variables and \(\hat L\) maximum value of the likelihood function. When comparing models, the associated AIC value is calculated, and the minumum value is chosen so as to reduce the loss of information from the model.

This process will be used to reduce a dataset with many variables.

3 Model Selection

coviDeathMean = mean(data[,1])
coviDeathSd = sd(data[,1])
data = as.data.frame(scale(data))

3.1 Ordinary Regression

3.1.1 Elementary Model

From the theory, we hypothesis the first elementary model for predicting COVID-19 death rates per state to be a first order model of all the data variables:

\[ y=\beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_{11}x_{11}+\epsilon \]

In R, the \(lm()\) function assumes the NULL Hypothesis for the T test statistic for each \(\beta\) is \(0\). On the other hand, the F test statistic used in the function compares the given model to the default reduced model of \(y=\beta_0 | \beta_1=\dots\beta_k=0\).

model = lm(`Covid Death Rate`~., data=data)
summary(model)
## 
## Call:
## lm(formula = `Covid Death Rate` ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90468 -0.32398 -0.02158  0.25137  1.37821 
## 
## Coefficients: (1 not defined because of singularities)
##                                            Estimate Std. Error t value
## (Intercept)                               8.062e-17  8.026e-02   0.000
## `Covid Case Rate`                         2.281e-01  9.830e-02   2.321
## `Diabetes Rate`                           7.589e-02  1.832e-01   0.414
## `Hypertension Rate`                       1.927e-01  1.103e-01   1.747
## `Kidney Disease Rate`                    -4.386e-01  2.761e-01  -1.589
## `Chronic Lower Respiratory Disease Rate` -2.552e-01  1.898e-01  -1.345
## `Influenza Rate`                         -1.264e-01  1.163e-01  -1.087
## `Obesity Rate`                           -3.510e-02  1.176e-01  -0.298
## `65 and Older Rate`                      -2.611e-02  1.271e-01  -0.206
## `Heart Disease Rate`                      7.224e-01  2.416e-01   2.990
## `Cancer Rate`                                    NA         NA      NA
## `Population Density`                      6.409e-01  1.083e-01   5.916
##                                          Pr(>|t|)    
## (Intercept)                               1.00000    
## `Covid Case Rate`                         0.02563 *  
## `Diabetes Rate`                           0.68102    
## `Hypertension Rate`                       0.08849 .  
## `Kidney Disease Rate`                     0.12019    
## `Chronic Lower Respiratory Disease Rate`  0.18643    
## `Influenza Rate`                          0.28368    
## `Obesity Rate`                            0.76692    
## `65 and Older Rate`                       0.83825    
## `Heart Disease Rate`                      0.00481 ** 
## `Cancer Rate`                                  NA    
## `Population Density`                     6.77e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5675 on 39 degrees of freedom
## Multiple R-squared:  0.7437, Adjusted R-squared:  0.6779 
## F-statistic: 11.31 on 10 and 39 DF,  p-value: 9.425e-09

From our F-statistic p-value we can see that with any \(\alpha\) value we have evidence to reject the NULL Hypothesis and conclude that ateleast one \(\beta_i\ne0,\forall i>0\). We can see from the T test statistics that many of the variables are insignficant and some are not; however, it is important to note that the coefficient estimate for Cancer Rate is NA, indicating that it is a linear combination of the other variables, thus it will be dropped for the rest of the analysis. To reduce the chance of making a type 1 error we will optimize our model to minimize AIC values in the next section.

3.1.2 Reduced Model

Our current elementary model contains many unnecessary variables; to get rid of them without increasing a Type 1 error, we will remove varaibles that minimize AIC values. This procedure will be done by using the \(step()\) function. Without wasting much space of the long output, the final suggested model is the following:

\[ y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\beta_4x_4+\beta_5x_5+\beta_6x_6+\epsilon \]

Where \(x_1=\text{Covid Case Rate}\), \(x_2=\text{Hypertension Rate}\), \(x_3=\text{Kidney Disease Rate}\), \(x_4=\text{Chronic Lower Respiratory Disease Rate}\), \(x_5=\text{Heart Disease Rate}\), and \(x_6=\text{Population Density}\). As one can see, we have removed Diabetes, Influenza, Obesity, Cancer, and 65+ rates from our model.

model = lm(formula = `Covid Death Rate` ~ `Covid Case Rate` + `Hypertension Rate` + 
    `Kidney Disease Rate` + `Chronic Lower Respiratory Disease Rate` + 
    `Heart Disease Rate` + `Population Density`, data = data)
summary(model)
## 
## Call:
## lm(formula = `Covid Death Rate` ~ `Covid Case Rate` + `Hypertension Rate` + 
##     `Kidney Disease Rate` + `Chronic Lower Respiratory Disease Rate` + 
##     `Heart Disease Rate` + `Population Density`, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06240 -0.34683 -0.00581  0.28656  1.38677 
## 
## Coefficients:
##                                            Estimate Std. Error t value
## (Intercept)                               6.298e-17  7.780e-02   0.000
## `Covid Case Rate`                         2.346e-01  8.802e-02   2.665
## `Hypertension Rate`                       1.815e-01  9.791e-02   1.854
## `Kidney Disease Rate`                    -4.084e-01  2.150e-01  -1.899
## `Chronic Lower Respiratory Disease Rate` -2.223e-01  1.663e-01  -1.336
## `Heart Disease Rate`                      6.139e-01  2.034e-01   3.018
## `Population Density`                      6.482e-01  1.026e-01   6.318
##                                          Pr(>|t|)    
## (Intercept)                               1.00000    
## `Covid Case Rate`                         0.01079 *  
## `Hypertension Rate`                       0.07064 .  
## `Kidney Disease Rate`                     0.06423 .  
## `Chronic Lower Respiratory Disease Rate`  0.18848    
## `Heart Disease Rate`                      0.00427 ** 
## `Population Density`                     1.26e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5502 on 43 degrees of freedom
## Multiple R-squared:  0.7344, Adjusted R-squared:  0.6973 
## F-statistic: 19.82 on 6 and 43 DF,  p-value: 6.155e-11

As one can see from the output, some of the T test statistic p-values are below the \(\alpha=0.05\) level, except Hypertension, Kidney Disease, and Chronic Lower Respiratory Disease; therefore, we have evidence to reject the NULL hypothesis and conclude that it is probable that these \(\beta\) coefficients do not equal zero while in the presence of the other coefficients.

It is interesting to point out that the \(\beta\) coefficient for both Kidney Disease and Chronic Lower Respiratory Disease Rates are negative, which is counter-intuitive to the model. This occurence is common in correlated variables.

When analyzing overall model adequacy, the p-value associated with the F statistic is below \(\alpha=0.05\), giving ample evidence to reject the NULL Hypothesis and conclude that atleast one \(\beta_i\ne0\forall i>0\). In addition, our adjusted R squared value is \(0.6973\); therefore, our model explains approximately \(69.73\)% of the variability in COVID-19 death rates even when adjusted for inflation of terms in our model.

3.1.3 Multicollinearity

However, now we must test for multicollinearity by using variance inflation, where values less than five to ten are considered acceptable (James and Tibshirani 2014):

car::vif(model)
##                        `Covid Case Rate` 
##                                 1.254196 
##                      `Hypertension Rate` 
##                                 1.552021 
##                    `Kidney Disease Rate` 
##                                 7.484471 
## `Chronic Lower Respiratory Disease Rate` 
##                                 4.479779 
##                     `Heart Disease Rate` 
##                                 6.698757 
##                     `Population Density` 
##                                 1.703763

As we can see from the factor numbers: Most of the variables have low variance inflation factor numbers, perhaps Kidney and Heart Disease are suspect. A furthur dive into the correlation coefficents can reveal more information:

cr = cor(data[,-1])
rownames(cr) = c("Case", "Diab.", "Hyper.", "Kidney", "Chronic.", "Influ.", "Obesity", "65", "Heart", "Cancer", "Pop. Dens.")
colnames(cr) = c("Case", "Diab.", "Hyper.", "Kidney", "Chronic.", "Influ.", "Obesity", "65", "Heart", "Cancer", "Pop. Dens.")
corrplot(cr, type = "upper", order="hclust", tl.col="black", tl.srt=45)

We can see from this visual diagram that many of the lower end variables are extremely positively correlated (Heart, Kidney, Cancer, Diabetes, and Chronic Lower Respiratory Disease); However, only three of our model’s variables seem to be correlated. Therefore, we have a may have possibility of multicollinearity.

3.1.4 Final Model

As stated before, our model is in a moderate state of concerns for multicollinearity amongst the independent variables. To reduce this concern, variables with the highest variance inflation factor were dropped until values became reasonable and acceptable. After this procedure, the last and final model was derived:

\[ y = \beta_0+\beta_1x_1+\beta_2x_2+\epsilon \]

where \(x_1=\text{Covid Case Rate}\) and \(x_2=\text{Population Density}\).

model = lm(formula = `Covid Death Rate` ~ `Covid Case Rate` + `Population Density`, data = data)
summary(model)
## 
## Call:
## lm(formula = `Covid Death Rate` ~ `Covid Case Rate` + `Population Density`, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0896 -0.3729 -0.1115  0.2584  2.2367 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -7.494e-17  8.537e-02   0.000 1.000000    
## `Covid Case Rate`     3.639e-01  8.649e-02   4.207 0.000115 ***
## `Population Density`  7.484e-01  8.649e-02   8.653 2.75e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6036 on 47 degrees of freedom
## Multiple R-squared:  0.6505, Adjusted R-squared:  0.6356 
## F-statistic: 43.74 on 2 and 47 DF,  p-value: 1.867e-11

From the T test statistic p-values, we can see that all of them are below our \(\alpha=0.05\), except for the intercept; therefore, we have sufficient evidence to reject the NULL Hypothesis and conclude that it is plausible these \(\beta\) coefficients are not equal to zero while in the presence of the other coefficients. For model adequacy, our F statistic p-value is once again below \(\alpha=0.05\); therefore, we have sufficient evidence to reject the NULL Hypothesis and conlcude that there exists evidence that atleast one \(\beta\) coefficient, not including the intercept, is non-zero. In addition, the adjusted R squared values suggests that our model explains approximately \(63.56\)% of the variability in COVID-19 deaths when adjusted for multiple terms.

For testing multicollinearity, the variance inflation factor values are equal and below 5:

car::vif(model)
##    `Covid Case Rate` `Population Density` 
##             1.005998             1.005998

Now since our model has a reduced variance inflation factor and has explanatory power, we will use this as our final model for describing COVID-19 death rates as a function of Covid Case Rate and Population Density.

3.2 Factor Analysis Regression

In the case of multicollinearity inside the dataset, instead of reducing by removing variables that maximize the AIC values, one can reduce the correlated variables down to a handful of independent factors. Factor analysis seeks to reduce a heavily correlated dataset to a set of underlying latent variables. Factor Analysis is modeled by what is known as the Orthogonal Factor Model with \(m\) common factors:

\[ X=\mu+LF+\epsilon \]

Such that \(E(F)=0\), \(Cov(F)=I\), \(E(\epsilon)=0\), and \(Cov(\epsilon)=\Psi\). It can be shown that the covariance matrix, \(\Sigma\), of \(X\) is equal to \(LL'+\Psi\).

The matrix \(L\) is known as the factor loadings, the reduced latent variables. There are different ways of estimating these loadings but Principal Component Analysis will be used in this analysis.

3.2.1 Theory

3.2.1.1 Principal Components

As stated before, \(L\) can be estimated by Principal Component Analysis. Principal Components simply deal with explaining the variance-covariance or correlation matrix of a dataset by a set of independent linear combinations of variables. Let \(Y\) denote the principal components, where \(aX\) is some linear combination of the data:

\[ \begin{matrix} Y_1 =a'_1X=a_{11}X_1+a_{12}X_2+\dots+a_{1p}X_p \\ Y_1 =a'_2X=a_{21}X_1+a_{22}X_2+\dots+a_{2p}X_p \\ \vdots \\ Y_p =a'_pX=a_{p1}X_1+a_{p2}X_2+\dots+a_{pp}X_p \end{matrix} \]

such that \(Var(Y_i)=a_i'\Sigma a_i\ \forall i,k=1,2,\dots,p\) and \(Cov(Y_i,Y_k)=a_i'\Sigma a_k\ \forall i,k=1,2,\dots,p\).

It can be shown that this independent linear combination of the data is actually the eigen vectors of the covariance or correlation matrix. Thus the following decomposition: \(Y_i=e_iX_i\) is obtained. This confirms the idea of the Spectral Decomposition theorem where a symmetric matrix can be decomposed into its eigen value eigen vector pairs. The proportion of the total data population variance explained by a principal component is its corresponding eigenvalue divided by the sum of eigenvalues of the components. Thus, components with larger eigen values are more influential to the model than components with smaller; this methodology is how we can reduce our dataset to a few underlying principal components.

3.2.1.2 Factor Loadings

As stated before, \(\Sigma=LL'+\Psi\) where \(L\) will be esimated by the principal components. If \(L\) is set to equal all the principal components then \(\Psi=0\). However, if the principal components are reduced down to a few heavily influential components then \(L=[\sqrt{\lambda_1}e_1|\sqrt{\lambda_2}e_2|\dots |\sqrt{\lambda_m}e_m]\) and \(\Psi=\psi_iI_m\) where \(m\) is the chosen number of factors and \(\psi_i=\sigma_{ii}-\sum_jl^2_{ij} \ \forall i=1,2,\dots,p\). The \(\Psi\) matrix is considered to be the specific variances, or how much the variability of the variable is not explained by the factor loadings. Thus, models with lower \(\psi_i\) values explain more the variability than those with higher \(\psi_i\) values.

Factor Loadings are interpreted as the correlation of each entry with the repective \(i'th\) variable. To aid in this interpreation, a factor loading rotation may be used, varimax in our analysis, which uses a tranformation matrix and chooses the angle which maximizes the Varimax Rotation equation.

3.2.1.3 Factor Scores

The factor loadings are used to reduce heavily correlated variables down to a set of independent linear combinations of the variables; in order to perform a multiple linear regression, the factor scores \(F\) must be obatined: the projection of the factor loadings upon the original dataset. Just like \(\beta\), least squares can be used to obtain \(F\):

\[ \begin{eqnarray} X-\mu&=&L\hat F+\hat \epsilon \\ X-\mu-L\hat F-\hat \epsilon&=&0 \\ L'(X-\mu-L\hat F-\hat \epsilon)&=&0 \\ L'(X-\mu)-L'L\hat F-L'\hat \epsilon&=&0 \\ L'L\hat F&=&L'(X-\mu) \ \ \ \ L' \text{ and }\hat \epsilon \text{ are orthogonal} \\ (L'L)^{-1}L'L\hat F&=&(L'L)^{-1}L'(X-\mu) \\ I\hat F&=&(L'L)^{-1}L'(X-\mu) \\ \hat F&=&(L'L)^{-1}L'(X-\mu) \\ \end{eqnarray} \]

The Factor Scores will be a \(n\text{ x }m\) matrix where \(m\) is the chosen number of factor loadings. Now one can use these factor scores as the new data for the heavily correlated variables in multiple linear regression.

3.2.2 Principal Component Analysis

When describing the complete dataset, principal component analysis was used to show the variability explained by the components. However, here principipal component analysis will only be used on the majorily correlated variables discussed earlier: Diabetes, Hypertension, Kidney Disease, Chronic Lower Respiratory Disease, Heart Disease, and Cancer morbidity rates. We will reduce the correlated variables mentioned through Factor Analysis. First off, the correlation matrix of the variables were broken down into its principal components (eigen vectors) and a scree plot was created to map each component by the scale of its eigen value:

dataFA = read_excel("COVID3.xls")

dataFA =dataFA[,-1]
DeathRate = dataFA[,1]
dataFA = dataFA[,-1]

dataFA=dataFA[,c(2,3,4,5,9,10)]
pc = prcomp(dataFA, scale=TRUE)
fviz_eig(pc)

As one can see from this scree plot, the first principal component explains almost 80% of the variability in the dataset, where components 2 through 4 explain around 10% or less. Therefore, we can reduce our six variable dataset down to one principal component.

3.2.3 Factor Analysis

Now since we have performed a short principal component analysis, only one principal component will be used for the formation of the factor loadings.

fa <-  fa(r=dataFA, nfactors = 1, rotate="varimax",fm="pa")
## In factor.scores, the correlation matrix is singular, an approximation is used
print(fa)
## Factor Analysis using method =  pa
## Call: fa(r = dataFA, nfactors = 1, rotate = "varimax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                                         PA1   h2    u2 com
## Diabetes Rate                          0.86 0.73 0.267   1
## Hypertension Rate                      0.56 0.31 0.689   1
## Kidney Disease Rate                    0.97 0.95 0.052   1
## Chronic Lower Respiratory Disease Rate 0.84 0.70 0.300   1
## Heart Disease Rate                     0.90 0.81 0.188   1
## Cancer Rate                            0.97 0.95 0.052   1
## 
##                 PA1
## SS loadings    4.45
## Proportion Var 0.74
## 
## Mean item complexity =  1
## Test of the hypothesis that 1 factor is sufficient.
## 
## The degrees of freedom for the null model are  15  and the objective function was  26.77 with Chi Square of  1235.68
## The degrees of freedom for the model are 9  and the objective function was  19.56 
## 
## The root mean square of the residuals (RMSR) is  0.05 
## The df corrected root mean square of the residuals is  0.06 
## 
## The harmonic number of observations is  50 with the empirical chi square  3.53  with prob <  0.94 
## The total number of observations was  50  with Likelihood Chi Square =  890.07  with prob <  8.5e-186 
## 
## Tucker Lewis Index of factoring reliability =  -0.221
## RMSEA index =  1.399  and the 90 % confidence intervals are  1.336 1.493
## BIC =  854.86
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    PA1
## Correlation of (regression) scores with factors   0.99
## Multiple R square of scores with factors          0.97
## Minimum correlation of possible factor scores     0.95

From the output, the RMSR value, sum of squared off diagonal residuals divided by the degrees of freedom, is 0.05, where values less than 0.05 show an adequate model. The same is applied for the corrected RSMR. However, our Tucker Lewis Index of factor reliability is low, 0.57 , as values greater than 0.9 are considered acceptable. Increasing to two factors increased this value to 0.96 as more variability in the dataset was explained by the second component. However, the second factor heavily correlated only to Obesity Rate, thus it was dropped in favor of supplemented the original Obesity Rate variable to the regressional model instead of the second factor.

In regards to our Factor Analysis, we can see the first and only factor, \(PA1\), is heavily correlated with Diabetes, Kidney Disease, Chronic Lower Respiratory Disease, Heart Disease, and Cancer rate; thus it may be called an ‘Underlying Health Conditions’ latent variable. The specific variances \(\Psi\), denoted by \(u2\), are extremely low except for Hypertension rates (68.9% of the variability in this variable is not explained), with mention of Diabetes and Chronic Lower Respiratory Disease rates.

Lastly, the adjusted R squared value is given for deriving the Factor scores, 0.97. Therefore, the regression model used to derive the factor scores explains approximately 97% of the variability centered variables.

fa.diagram(fa)

Here we can see a pictoral representation of the factor loading values against their corresponding values, where each loading entry is the correlation value for the repsective variable.

3.2.4 Interpreting Underlying Health Conditions

The interpretation behind the Factor Loading used in our regressional analysis must begin by looking at the factor loading entry values:

fa$loading
## 
## Loadings:
##                                        PA1  
## Diabetes Rate                          0.856
## Hypertension Rate                      0.557
## Kidney Disease Rate                    0.974
## Chronic Lower Respiratory Disease Rate 0.837
## Heart Disease Rate                     0.901
## Cancer Rate                            0.974
## 
##                  PA1
## SS loadings    4.451
## Proportion Var 0.742

Each entry of our factor loading corresponds to how correlated the loading is with respect to the variable, thus all these variables, except Hypertension, are heavily correlated. Consequently, any increase in our factor loading will correspond to an increase in these values. To see how much of an increase in the factor loading corresponds to an increase in the variables, we need to look at the regression behind deriving \(F\).

fa$weights
##                                               PA1
## Diabetes Rate                          0.17491006
## Hypertension Rate                      0.02108163
## Kidney Disease Rate                    0.33223086
## Chronic Lower Respiratory Disease Rate 0.08812859
## Heart Disease Rate                     0.10055472
## Cancer Rate                            0.33222736

Here we can see the \(\beta\) like coefficients for \(F\), derived from inside the Factor Analysis, where \(X-\mu=LF\). Thus, our centered data \(X-\mu\) is a linear combination of our Factor Loadings \(L\) against these \(\beta\) like coefficients for \(F\). Thus, any increase in Underlying Health Conditions will increse Kidney Disease Rate by \(0.3322\), Cancer Rate by \(0.33\), etc. As we can see, any increase in our loading will heavily contribute increase to a higher value for Kidney Disease and Cancer rates, with mention of Diabetse.

3.2.5 Model Adequacy

The Factor scores from our factor analysis were obtained and combined with the variables that were not represented well, Covid Case Rate and Population Density.

dataFA = data.frame(DeathRate,fa$scores)
dataFA = cbind(DeathRate, fa$scores, data$`Covid Case Rate`, data$`Population Density`)
colnames(dataFA ) = c("Covid Death Rate", "Underlying Health Conditions", "Covid Case Rate", "Population Density")
modelFa = lm(`Covid Death Rate` ~ ., data=dataFA )
summary(modelFa)
## 
## Call:
## lm(formula = `Covid Death Rate` ~ ., data = dataFA)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.414e-04 -1.549e-04 -5.043e-05  1.177e-04  9.028e-04 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    5.923e-04  3.448e-05  17.180  < 2e-16 ***
## `Underlying Health Conditions` 2.337e-05  3.557e-05   0.657 0.514398    
## `Covid Case Rate`              1.433e-04  3.519e-05   4.072 0.000182 ***
## `Population Density`           2.999e-04  3.494e-05   8.584 4.14e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0002438 on 46 degrees of freedom
## Multiple R-squared:  0.6537, Adjusted R-squared:  0.6312 
## F-statistic: 28.95 on 3 and 46 DF,  p-value: 1.145e-10

As we can see from the F test statistic p-value, no matter what \(\alpha\) value we choose we will have sufficient evidence to reject the NULL Hypothesis and conclude with confidence that atleast one \(\beta_i\ne0\forall i>0\). It was shown in the Ordinary Regression that the T test statistic values for Covid Case Rate and Population Density were all significant, as represented here. However, it is extremely interesting to note that our Factor Loading Underlying Health Conditions is practically useless. The associated p-value is greater than \(\alpha=0.05\); therefore we have insufficient evidence to reject the NULL and conclude with confidence that \(\beta_1=0\).

As stated before, the Underlying Health Conditions factor loading is heavily correlated with Diabetes, Kidney, Chronic Lower Respiratory Disease, Heart Disease, and Cancer Rates. Thus, these strongly correlated variables can be reduced down to one latent variable which we have called Underlying Health Conditions. It is interesting to note that these factor loading shows no influence in our regressional model; therefore, with the given evidence the underlying latent variable of these health conditions is not influential in our model for COVID-19 Death rates per state. Consequently, this factor analysis regressional model will be dropped for the rest of the analysis as it would then be equivalent to that of our ordinary regressional model.

4 Checking Assumptions 1

As stated before, our theory for multiple linear regression is built off the following four assumptions:

  • Normaility of the Data \(\epsilon\sim N(0, \sigma^2)\)

  • Mean of Errors is zero \(E(\epsilon)=0\)

  • Constant Variance for the Errors \(V(\epsilon_i)=\sigma^2\)

  • Independence of the Data

4.0.1 Normality

normcheck(model, shapiro.wilk = TRUE)

In Shapiro and Wilk’s original work, they wrote that their statistical method for detecting normality was derived by “dividing the square of an appropriate linear combination of the sample order statistics by the usual symmetric estimate of variance.” This apparent ratio was deemed “invariant,” hence it was declared as an “appropriate for a test of the composite hypothesis of normality.” (see WILK 1965). From the Shapiro-Wilk test above, the p value is less than the chosen \(\alpha\) value of 0.05; therefore, there is sufficient evidence to reject the NULL Hypothesis and conclude with confidence that the residuals are not normal in distribution. Thus, the major part of our assumption is violated.

Mendenhall and Sincich wrote that economic and scientific data that have non-normal residuals can be corrected by applying a stable tranformation, namely a natural logarithm (Mendenhall and Sincich 2016). Thus, we will apply a natural logarithm transformation to the response variable.

5 Updating the Models

Now we will apply a natural logarithm to the response variable to get the following model:

5.1 Ordinary Regression

5.1.1 Elementary Model

data=data[,-11]
model = lm(log(`Covid Death Rate`)~ ., data=data)
summary(model)
## 
## Call:
## lm(formula = log(`Covid Death Rate`) ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.25803 -0.43916  0.07437  0.46882  1.01564 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                              -0.83808    0.40911  -2.049
## `Covid Case Rate`                        -0.08315    0.28072  -0.296
## `Diabetes Rate`                           1.67063    0.72414   2.307
## `Hypertension Rate`                      -0.04360    0.54059  -0.081
## `Kidney Disease Rate`                    -0.80591    1.48543  -0.543
## `Chronic Lower Respiratory Disease Rate` -0.85270    0.62433  -1.366
## `Influenza Rate`                          0.69046    0.59599   1.159
## `Obesity Rate`                           -0.27646    0.45694  -0.605
## `65 and Older Rate`                       0.11628    0.47027   0.247
## `Heart Disease Rate`                      0.05542    0.60626   0.091
## `Population Density`                      0.59314    0.27603   2.149
##                                          Pr(>|t|)  
## (Intercept)                                0.0708 .
## `Covid Case Rate`                          0.7738  
## `Diabetes Rate`                            0.0465 *
## `Hypertension Rate`                        0.9375  
## `Kidney Disease Rate`                      0.6006  
## `Chronic Lower Respiratory Disease Rate`   0.2052  
## `Influenza Rate`                           0.2765  
## `Obesity Rate`                             0.5601  
## `65 and Older Rate`                        0.8103  
## `Heart Disease Rate`                       0.9292  
## `Population Density`                       0.0601 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8199 on 9 degrees of freedom
##   (30 observations deleted due to missingness)
## Multiple R-squared:  0.7683, Adjusted R-squared:  0.5109 
## F-statistic: 2.985 on 10 and 9 DF,  p-value: 0.05747

From our F-statistic p-value we can see that with \(\alpha=0.05\) value our model is on the verge of being influential. We can see from the T test statistics that all of the variables are insignficant, except Diabetes Rate. Even though none of the variables seem influential, our model has an adjusted R squared value of \(0.5109\). We will now minimize our model using AIC values.

5.1.2 Reduced Model

After minimizing our logarithm tranformed dataset through the \(step()\) function, we get the following reduced model:

\[ y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3 \]

Where \(x_1=\text{Diabetes Rate}\), \(x_2=\text{Chronic Lower Respiratory Disease}\) and \(x_3=\text{Population Density}\).

model = lm(formula = log(`Covid Death Rate`) ~ `Diabetes Rate`  + `Chronic Lower Respiratory Disease Rate` + `Population Density`, 
    data = data)
summary(model)
## 
## Call:
## lm(formula = log(`Covid Death Rate`) ~ `Diabetes Rate` + `Chronic Lower Respiratory Disease Rate` + 
##     `Population Density`, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1015 -0.5281 -0.1493  0.5667  1.2422 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                               -1.0437     0.1992  -5.238
## `Diabetes Rate`                            1.3742     0.4059   3.386
## `Chronic Lower Respiratory Disease Rate`  -1.0185     0.3427  -2.972
## `Population Density`                       0.6168     0.1711   3.605
##                                          Pr(>|t|)    
## (Intercept)                              8.12e-05 ***
## `Diabetes Rate`                           0.00377 ** 
## `Chronic Lower Respiratory Disease Rate`  0.00900 ** 
## `Population Density`                      0.00237 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.777 on 16 degrees of freedom
##   (30 observations deleted due to missingness)
## Multiple R-squared:  0.6301, Adjusted R-squared:  0.5607 
## F-statistic: 9.083 on 3 and 16 DF,  p-value: 0.0009592

As one can see from our F test statistic p-value, with \(\alpha=0.05\) we have enough evidence to reject the NULL hypothesis and conclude atleast one \(\beta_i\ne0\forall i>0\). All of our T test statistic p-values are less than our \(\alpha\) value; therefore we have enough evidence to reject the NULL and conclude with confidence that these \(\beta\) values are not equal to zero, thus influential. Our adjusted R squared value is now \(0.5607\), which implied that our model explains approximately \(56.07\)% of the variability in COVID-19 Death rates. It is interesting to note that removing variables from our AIC procedure made our model statistically influential when compared to the elementary model.

car::vif(model)
##                          `Diabetes Rate` 
##                                 2.304100 
## `Chronic Lower Respiratory Disease Rate` 
##                                 2.082863 
##                     `Population Density` 
##                                 1.587847

As we can see from the Variance Inflation Factors, our variables are non correlated.

5.2 Factor Analysis Regression

Before performing a logarithm tranformation it was shown that our correlated variables reduced down to a single factor loading, which we called Underlying Health Conditions, was un-influential to our model. However, now since we have transformed our dataset, let us now re-evaluate our factor analysis model but with second order terms because it was revealted through trial and error that this helped concieve constant trends in the mean of our residuals:

\[ y=\beta_0+\beta_1x_1+\beta_2x_1^2+\beta_3x_2+\beta_4x_2^2+\beta_5x_3+\beta_6x_3^2 \]

Where \(x_1=\text{Underlying Health Conditions}\), \(x_2=\text{Covid Case Rate}\) and \(x_3=\text{Popultion Density}\).

dataFA = cbind(DeathRate, fa$scores, data$`Covid Case Rate`, data$`Population Density`)
colnames(dataFA ) = c("Covid Death Rate", "Underlying Health Conditions", "Covid Case Rate", "Population Density")
modelFa = lm(log(`Covid Death Rate`) ~`Underlying Health Conditions`+I(`Underlying Health Conditions`^2)+`Covid Case Rate`+I(`Covid Case Rate`^2)+`Population Density`+I(`Population Density`^2), data=dataFA )
summary(modelFa)
## 
## Call:
## lm(formula = log(`Covid Death Rate`) ~ `Underlying Health Conditions` + 
##     I(`Underlying Health Conditions`^2) + `Covid Case Rate` + 
##     I(`Covid Case Rate`^2) + `Population Density` + I(`Population Density`^2), 
##     data = dataFA)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73851 -0.30371 -0.04673  0.29160  0.69236 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         -7.42492    0.07967 -93.190  < 2e-16
## `Underlying Health Conditions`       0.06975    0.06802   1.026   0.3109
## I(`Underlying Health Conditions`^2) -0.06993    0.03068  -2.279   0.0277
## `Covid Case Rate`                    0.38910    0.06351   6.126 2.39e-07
## I(`Covid Case Rate`^2)              -0.08840    0.04096  -2.158   0.0365
## `Population Density`                 0.61138    0.12729   4.803 1.92e-05
## I(`Population Density`^2)           -0.09964    0.04794  -2.078   0.0437
##                                        
## (Intercept)                         ***
## `Underlying Health Conditions`         
## I(`Underlying Health Conditions`^2) *  
## `Covid Case Rate`                   ***
## I(`Covid Case Rate`^2)              *  
## `Population Density`                ***
## I(`Population Density`^2)           *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.397 on 43 degrees of freedom
## Multiple R-squared:  0.7728, Adjusted R-squared:  0.7411 
## F-statistic: 24.38 on 6 and 43 DF,  p-value: 2.352e-12

Here we can see from the F test statistic p-value that no matter what \(\alpha\) value we choose we will always have enough evidence to reject the NULL and conclude with confidence that atleast one of these \(\beta\) values does not equal zero. We can also see that our T test statistic p-values are significant, therefore we have enough evidence to convey that these variables are influential. For adequacy, our model explains approximately \(74.11\)% of the variability in COVID-19 death rates. After transforming the response variable our underlying latent variable, Underlying Health Conditions, is influential to our model.

5.3 Comparing the Models

Our first Ordinary Regressional model is the following:

\[ y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3 \]

Where \(x_1=\text{Diabetes Rate}\), \(x_2=\text{Chronic Lower Respiratory Disease}\) and \(x_3=\text{Popultion Density}\); with a model adequacy of an adjusted R squared value of \(0.5607\).

Our Factor Analysis model is the following:

\[ y=\beta_0+\beta_1x_1+\beta_2x_1^2+\beta_3x_2+\beta_4x_2^2+\beta_5x_3+\beta_6x_3^2 \]

Where \(x_1=\text{Underlying Health Conditions}\), \(x_2=\text{Covid Case Rate}\) and \(x_3=\text{Popultion Density}\); with a model adequacy of an adjusted R squared value of \(0.7411\).

When comparing the two models with the adjsuted R squared value the Factor Analysis model has a higher value than the Ordinary Regressional model; therefore, the Factor model will be used for the rest of the analysis.

6 Checking Assumptions 2

As stated before, our theory for multiple linear regression is built off the following four assumptions:

  • Normaility of the Data \(\epsilon\sim N(0, \sigma^2)\)

  • Mean of Errors is zero \(E(\epsilon)=0\)

  • Constant Variance for the Errors \(V(\epsilon_i)=\sigma^2\)

  • Independence of the Data

6.1 Factor Analysis Regression

6.1.1 Normality

normcheck(modelFa, shapiro.wilk = TRUE)

From the Shapiro-Wilk test above, the p value is gerater than the chosen \(\alpha\) value of 0.05; therefore, there is insufficient evidence to reject the NULL Hypothesis and conclude with confidence that the residuals are normal in distribution.

6.1.2 Mean of Errors

We have assumed the errors have a mean of zero, \(E(\epsilon)=0\). To check this assumption, we plot the residuals verses each variable and check to see if there is any trends about the data.

Here we have our residuals versus Underlying Health Conditions:

fit=predict.lm(modelFa, newdata=dataFA)
resid = resid(modelFa)
df = as.data.frame(cbind(fit, resid))
df2 = as.data.frame(cbind(resid, dataFA$`Underlying Health Conditions`))
df3 = as.data.frame(cbind(resid, dataFA$`Covid Case Rate`))
df4 = as.data.frame(cbind(resid, dataFA$`Population Density`))
ggplot(data=df2, aes(x=V2, y=resid))+geom_point()+xlab("X1-Underlying Health Conditions")+ylab("Residuals")+ggtitle("Residuals vs. X1-Underlying Health Conditions")+geom_hline(yintercept=0, linetype="dashed")+geom_smooth(method = "loess")

We can see that our residuals versus Underlying Health Conditions is situated approximately about the mean of zero, therefore this assumption holds for this variable.

Here we have our residuals versus Covid Case Rate:

ggplot(data=df3, aes(x=V2, y=resid))+geom_point()+xlab("X2-Covid Case Rate")+ylab("Residuals")+ggtitle("Residuals vs. X2-Covid Case Rate")+geom_hline(yintercept=0, linetype="dashed")+geom_smooth(method = "loess")

Even though there seems to be a negative parabolic trend around the -2 and -1 standard deviation, the spread of the data in this interval seems consistent to the rest of the data.

Here we have the residuals plotted against our second variable, Heart Disease Morbidity cases.

ggplot(data=df4, aes(x=V2, y=resid))+geom_point()+xlab("X3-Population Density")+ylab("Residuals")+ggtitle("Residuals vs. X3-Population Density")+geom_hline(yintercept=0, linetype="dashed")+geom_smooth(method = "loess")

As one can see from the graph, our resdiuals are situated about the horziontal mean line of 0. Even towards the second and third standard deviations the observations still show a constant trend of variation when compared to the majority.

6.1.3 Constant Variance for Errors

This assumption asserts that the errors in our model all have a constant variance, \(\sigma^2\). Here we must check for homoscedastic errors amongst the residuals.

ggplot(data=df, aes(x=fit, y=resid))+geom_point()+xlab("Fitted Values (y estimates)")+ylab("Residuals")+ggtitle("Residuals vs. Fitted Values")+geom_hline(yintercept=0, linetype="dashed")

As we can see here from the residuals versus fitted values plot, the varaince seems constant around the fitted values. Even though our model does not show perfect homoscedastic errors, it is not marred enough to raise large concern.

6.1.4 Independence of the Data

For each observation, the state, the data for the variables were collected independently. Counting the number of morbidity cases for COVID-19, Health Disease, Chronic Lower Respiratory Disease, etc. for a state did not have an effect on the counting of another. The same logic can be applied to the other variables. Each variable was recorded at different times, where each observation was taken simultaneously at the recording of the respective variable. Therefore, the data observations are independent.

7 Analysis

7.1 Factor Analysis Regression

7.1.1 Regressional Plane

Even though we have three variables in our model, we can create a three dimensional regresssional plane by keep Underlying Health Conditions constant within our model, thus we can create a three dimensional regressional plane of Population Density and Covid Case Rate and project it against the number of COVID-19 deaths. The graph contains not only the regressional plane but a 3D scatterplot of the original data (Note: this is per state):

popMean = mean(orig$`Population Density`)
popSd = sd(orig$`Population Density`)
caseMean = mean(orig$`Covid Case Rate`)
caseSd = sd(orig$`Covid Case Rate`)
myPred = function(model, underlyingConstant, x) {
  pred = matrix(0, ncol=1, nrow=nrow(x))
  for(i in 1:nrow(x)) {
    a = cbind(underlyingConstant, x[i,])
    colnames(a) = c("Underlying Health Conditions", "Covid Case Rate", "Population Density")
    pred[i,1]=predict(model, a)
  }
  exp(pred)
}
ind = as.data.frame((matrix(c(0, 1, 2, 3), ncol=4)*popSd)+popMean)
ind2 = as.data.frame((matrix(c(-2, 1, 0, 1, 2, 3), nrow=1)*caseSd)+caseMean)
colnames(ind) = c("0", "1", "2", "3")
rownames(ind) = c("Pop. D.")
colnames(ind2) = c("-2", "-1", "0", "1", "2", "3")
rownames(ind2) = c("C. R.")

7.1.1.1 20% Quantile

Here we plot Covid Case Rate and Population Density with a constant Underlying Health Condition equal to its \(20\)% quantile against Covid Death Rate. Note that Covid Death Rate is in its original metrics but Covid Case Rate and Population Density are in their z-scores.

xVal = dataFA$`Covid Case Rate`
yVal = dataFA$`Population Density`
xPred = seq(min(xVal), max(xVal))
yPred = seq(min(yVal), max(yVal))

zPred = expand.grid(`Covid Case Rate` = xPred, `Population Density`=yPred, KEEP.OUT.ATTRS = FALSE)
zPred$`Covid Death Rate` = (myPred(modelFa, quantile(dataFA$`Underlying Health Conditions`, 0.2), zPred))
zPred = acast(zPred, `Covid Case Rate`~`Population Density`, value.var="Covid Death Rate")

myPlot = plot_ly(dataFA, x=~`Covid Case Rate`, y=~`Population Density`, z=~`Covid Death Rate`, type="scatter3d")
myPlot = add_trace(p=myPlot, z = zPred, x=xPred, y=yPred, type = "surface")
myPlot
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

For reference in the plot, here are the zscore values (column names) and their corresponding original variable values:

Population Density (People per square mile):

ind
##                0        1        2        3
## Pop. D. 203.8526 468.4812 733.1099 997.7386

Case Rate (Number of cases per person)

ind2
##                -2         -1          0          1          2          3
## C. R. 0.003618468 0.03964285 0.02763473 0.03964285 0.05165098 0.06365911

7.1.1.2 50% Quantile

Here we plot Covid Case Rate and Population Density with a constant Underlying Health Condition equal to its \(50\)% quantile against Covid Death Rate. Note that Covid Death Rate is in its original metrics but Covid Case Rate and Population Density are in their z-scores.

xVal = dataFA$`Covid Case Rate`
yVal = dataFA$`Population Density`
xPred = seq(min(xVal), max(xVal))
yPred = seq(min(yVal), max(yVal))

zPred = expand.grid(`Covid Case Rate` = xPred, `Population Density`=yPred, KEEP.OUT.ATTRS = FALSE)
zPred$`Covid Death Rate` = (myPred(modelFa, quantile(dataFA$`Underlying Health Conditions`, 0.5), zPred))
zPred = acast(zPred, `Covid Case Rate`~`Population Density`, value.var="Covid Death Rate")
myPlot = plot_ly(dataFA, x=~`Covid Case Rate`, y=~`Population Density`, z=~`Covid Death Rate`, type="scatter3d")
myPlot = add_trace(p=myPlot, z = zPred, x=xPred, y=yPred, type = "surface")
myPlot
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

For reference in the plot, here are the zscore values (column names) and their corresponding original variable values:

Population Density (People per square mile):

ind
##                0        1        2        3
## Pop. D. 203.8526 468.4812 733.1099 997.7386

Case Rate (Number of cases per person)

ind2
##                -2         -1          0          1          2          3
## C. R. 0.003618468 0.03964285 0.02763473 0.03964285 0.05165098 0.06365911

7.1.1.3 80% Quantile

Here we plot Covid Case Rate and Population Density with a constant Underlying Health Condition equal to its \(80\)% quantile against Covid Death Rate. Note that Covid Death Rate is in its original metrics but Covid Case Rate and Population Density are in their z-scores.

xVal = dataFA$`Covid Case Rate`
yVal = dataFA$`Population Density`
xPred = seq(min(xVal), max(xVal))
yPred = seq(min(yVal), max(yVal))

zPred = expand.grid(`Covid Case Rate` = xPred, `Population Density`=yPred, KEEP.OUT.ATTRS = FALSE)
zPred$`Covid Death Rate` = (myPred(modelFa, quantile(dataFA$`Underlying Health Conditions`, 0.8), zPred))
zPred = acast(zPred, `Covid Case Rate`~`Population Density`, value.var="Covid Death Rate")

myPlot = plot_ly(dataFA, x=~`Covid Case Rate`, y=~`Population Density`, z=~`Covid Death Rate`, type="scatter3d")
myPlot = add_trace(p=myPlot, z = zPred, x=xPred, y=yPred, type = "surface")
myPlot
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

For reference in the plot, here are the zscore values (column names) and their corresponding original variable values:

Population Density (People per square mile):

ind
##                0        1        2        3
## Pop. D. 203.8526 468.4812 733.1099 997.7386

Case Rate (Number of cases per person)

ind2
##                -2         -1          0          1          2          3
## C. R. 0.003618468 0.03964285 0.02763473 0.03964285 0.05165098 0.06365911

7.1.2 Confidence Intervals

Here we will construt \(95\)% confidence intervals for each of the \(\beta\) coefficients.

confint(modelFa)
##                                           2.5 %       97.5 %
## (Intercept)                         -7.58560075 -7.264241203
## `Underlying Health Conditions`      -0.06741585  0.206921155
## I(`Underlying Health Conditions`^2) -0.13180027 -0.008057839
## `Covid Case Rate`                    0.26101544  0.517177222
## I(`Covid Case Rate`^2)              -0.17101328 -0.005794841
## `Population Density`                 0.35466854  0.868085641
## I(`Population Density`^2)           -0.19632229 -0.002947855

We can see here that our 95% confidence interval that the underlying mean value of the intercept is negative. From Covid Case Rate, we can say that a single increase in Covid Case Rate standard deviation per state will increase COVID-19 death rates by some value in between \((0.261,0.5171)\) subracted by the squared unit value in between \((-0.171, -0.005)\). For Population Density we can say that a single increase in Population Density standard deviation per state will increase COVID-19 death rates by some value in between \((0.3455,0.8688)\) subracted by the squared unit value in between \((-0.1963, -0.0029)\). It was stated earlier that the p-value for our T test statistic for the single order term in Underlying Health Conditions was non-signficiant, and we can see that here when the 95% confidence interval contains zero. Here we have two possible interpretations. If the true underlying mean for Underlying Health Conditions is shown to be positive, then it is an influential predictor in our model like the rest of the single order terms, which will eventually be taken over by their negative second order terms as the unit sizes grow larger; however, if the true mean was shown to be zero then a totally different intrepretation will be derived. If the mean of Underlying Health Conditions was shown to be zero, then the only term from the variable would be the negative quadratic. Therefore, as Underlying Health Conditions increase, COVID-19 death rates would be shown to decrease instead of increase if the single order term was influential. The interpretation behind this may be that as states with higher morbidity rates in these Underlying Health Conditions would have lower COVID-19 morbidity rates because the persons who would have died from COVID-19 have already died from these Underlying Health Conditions.

Now let us check for the difference in \(\beta\) coefficients:

x = model.matrix(~`Underlying Health Conditions` + I(`Underlying Health Conditions`^2) + `Covid Case Rate` + 
     I(`Covid Case Rate`^2) + `Population Density` + I(`Population Density`^2), data=dataFA)
y = data$`Covid Death Rate`
c = solve(t(x)%*%x)
beta = c%*%t(x)%*%y
t = qt(1-0.05/2, nrow(data)-(3))
s = summary(modelFa)$sigma

Difference between Underlying Health Conditions and Covid Case Rate \((\beta_1+\beta_2)-(\beta_3+\beta_4)\):

a = c(0, 1, 1, -1, -1, 0, 0)
l = a%*%beta
temp = t*s*sqrt(t(a)%*%c%*%a)
m=matrix(c(l-temp, l+temp), ncol=2)
rownames(m) = c("b1/2-b3/4")
colnames(m) = c("Lwr", "Uppr")
as.data.frame(m)
##                  Lwr       Uppr
## b1/2-b3/4 -0.5035708 -0.0765703

Here we can see that our 95% confidence interval is negative, implying with confidence that the mean difference between Underlying Health Conditions contributes less than Covid Case Rate to our model to some value in between \((-0.5, -0.07)\).

Difference between Underlying Health Conditions and Population Density \((\beta_1+\beta_2)-(\beta_5+\beta_6)\):

a = c(0, 1, 1, 0, 0, -1, -1)
l = a%*%beta
temp = t*s*sqrt(t(a)%*%c%*%a)
m=matrix(c(l-temp, l+temp), ncol=2)
rownames(m) = c("b1/2-b5/6")
colnames(m) = c("Lwr", "Uppr")
as.data.frame(m)
##                  Lwr       Uppr
## b1/2-b5/6 -0.9716656 -0.4964539

Here we can see that our 95% confidence interval is negative, implying with confidence that the mean difference between Underlying Health Conditions contributes less than Population Density to our model to some value in between \((-0.97, -0.496)\).

Difference between Covid Case Rate and Population Density \((\beta_3+\beta_4)-(\beta_5+\beta_6)\):

a = c(0, 0, 0, 1, 1, -1, -1)
l = a%*%beta
temp = t*s*sqrt(t(a)%*%c%*%a)
m=matrix(c(l-temp, l+temp), ncol=2)
rownames(m) = c("b3/4-b5/6")
colnames(m) = c("Lwr", "Uppr")
as.data.frame(m)
##                  Lwr       Uppr
## b3/4-b5/6 -0.6275606 -0.2604179

Here we can see that our 95% confidence interval is negative, implying with confidence that the mean difference between Covid Case Rate contributes less than Population Density to our model to some value in between \((-0.627, -0.26)\).

In summary, Population Density is the most influential factor in our model, followed by Covid Case Rate, and trailed by Underlying Health Conditions.

7.1.3 Point Estimates

Here we will construct \(95\)% confidence intervals for COVID-19 Death Rates estimated from our model’s regression along with the observed values and their difference with the point estimates.

ci=exp(predict(modelFa, interval="confidence"))
ci=cbind(orig$`Covid Death Rate`, ci[,1],orig$`Covid Death Rate`-ci[,1], ci[,2], ci[,3])
colnames(ci) = c("Actual", "Fitted", "Difference", "Lower", "Upper")
paged_table(as.data.frame(ci))

7.1.4 Underlying Health Conditions

Underlying Health Conditions contributes the least to our model but COVID-19 death rates will increase per state for every unit increase in this factor loading with 95% confidence by some value \((-0.067, 0.2069)\) subtracted by the square of \((-0.1318, -0.008)\). As stated before, one unit increase in Underlying Health Conditions is heavily correlated to a unit increase in Diabetes, Kidney Disease, Chronic Lower Respiratory Disease, Heart Disease, and Cancer mortality rates:

fa$loadings
## 
## Loadings:
##                                        PA1  
## Diabetes Rate                          0.856
## Hypertension Rate                      0.557
## Kidney Disease Rate                    0.974
## Chronic Lower Respiratory Disease Rate 0.837
## Heart Disease Rate                     0.901
## Cancer Rate                            0.974
## 
##                  PA1
## SS loadings    4.451
## Proportion Var 0.742

Where each variable is incresed by the weights:

fa$weights
##                                               PA1
## Diabetes Rate                          0.17491006
## Hypertension Rate                      0.02108163
## Kidney Disease Rate                    0.33223086
## Chronic Lower Respiratory Disease Rate 0.08812859
## Heart Disease Rate                     0.10055472
## Cancer Rate                            0.33222736

As one can see, as Kidney Disease and Cancer Rates seem to be the most influential variables to the regression. Thus, as these variables increase by one unit, we expect that Underlying Health Conditions to increase by these values.

8 Conclusion

The CDC reports major underlying health conditions and comobordities for COVID-19 deaths. These morbidity rates were collected independtly for each state. Extra variables were also added that intuitively contributed, such as population density, population of those over 65 years of age, and COVID cases per state. Each of these variables, except Population Density, were scaled by the states population size to derive percentages. A simple single order model was proposed but was eventually reducted to a handful of variables by minimizing AIC values. However, after using variance inflation factor values and displaying the correlation matrix of the reduced variables it was shown that there existed multicollinearity amongst our data. Variable after variable was removed from our model until the variance inflation factors became acceptable, resulting in the simple two factor model \(y=\beta_0=\beta_1x_1+\beta_2x_2\), related COVID-19 death rates to COVID-19 Case Rate and Population Density per state. A factor model was proposed, where a new variable to the ordinary regression was added, Underlying Health Conditions; however, it was shown this variable was non-influential to the model.

After checking assumptions it was shown that normality of the residuals was not kept by using the Shapiro-Wilk test; in response, a natural logarithm transformation was used on the response variable. A new ordinary model was derived which related COVID-19 death rates to Diabetes, Chronic Lower Respiratory Disease, and Population Density. On the other hand, our factor model’s latent variable, Underlying Health Conditions, was shown to be influential after the transformation. After trial and error, a second order model for the factor regression model was proposed: \[y=\beta_0+\beta_1x_1+\beta_2x_1^2+\beta_3x_2+\beta_4x_2^2+\beta_5x_3+\beta_6x_3^2\] Where \(x_1=\text{Underlying Health Conditions}\), \(x_2=\text{Covid Case Rate}\) and \(x_3=\text{Popultion Density}\). The ordinary regression model had an R squared value of \(0.5607\) compared to the factor model of \(0.7411\); therefore, the factor analysis model was accepted for the rest of the analysis.

Confidence intervals for the \(\beta\) coefficients were calculated and it was shown that the mean and underlying difference with 95% confidence for all the variables were different, where any single unit increase in Population Density will increase a single unit in COVID-19 Death Rates more than Covid Case Rate and Underlying Health Concerns.

In conclusion, the data shows evidence that states with higher COVID-19 Case rates, higher Population Densities, and Underlying Health Conditions will have higher COVID-19 morbidity rates than states with lower, where Underlying Health Conditions is heavily correlated to Diabetes rates, Hypertension, Kidney Disease, Chronic Lower Respiratory Disease, Heart Disease, and Cancer morbidity rates. It was shown that Influenza morbidity rates, rates of Obesity and those over the age of 65 had no statistical influence to the model. However, the possibility of the single order term for Underlying Health Conditions being zero (due to the p-value of the single term being non-significant) results in the model only having a negative quadratic term for Underlying Health Condtions. Therefore, as Underlying Health Conditions increses COVID-19 death rates will drop. The intuition behind this scenario may be that as states with higher morbidity rates in these Underlying Health Conditions would have lower COVID-19 morbidity rates because the persons who would have died from COVID-19 have already died from these Underlying Health Conditions. Note: COVID-19 case and death rates are current as of 10/31/2020.

References

“Daily Updates of Totals by Week and State.” 2020. Available at Centers for Disease Control https://www.cdc.gov/nchs/nvss/vsrr/covid19/index.htm.

“Image Library.” 2020. Available at Centers for Disease Control https://www.cdc.gov/media/subtopic/images.htm.

James, Daniela Witten, Gareth, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Spring Publishing Company.

Mendenhall, William M., and Terry L. Sincich. 2016. Statistics for Engineering and the Sciences. 6th ed. CRC Press.

Mukaka, MM. 2012. “A Guide to Appropriate Use of Correlation Coefficient in Medical Research.” In A Guide to Appropriate Use of Correlation Coefficient in Medical Research, 69–71.

“Population Density in the U.s. By Federal States Including the District of Columbia in 2019.” 2019. https://www.statista.com/statistics/183588/population-density-in-the-federal-states-of-the-us/.

“Population Distribution by Age.” 2019. https://www.kff.org/other/state-indicator/distribution-by-age.

Sauer, Laruen M. 2020. “What Is Coronvairus?” Available at Hopkin Medicine https://www.hopkinsmedicine.org/health/conditions-and-diseases/coronavirus.

“Stats of the States.” n.d. Available at Centers for Disease Control https://www.cdc.gov/nchs/pressroom/sosmap/nav_us.htm.

“Weekly Updates by Select Demographic and Geographic Characteristics.” 2020. Available at Centers for Disease Control https://www.cdc.gov/nchs/nvss/vsrr/covid_weekly/index.htm.

WILK, S. S. SHAPIRO AND M. B. 1965. “Shapiro Wilk Statistic.” In An Analysis of Variance Test for Normality (Complete Samples), 591. UK.